Systems and Methods for Data Analysis

ABSTRACT

Described herein are methods and systems for hierarchically mapping, ranking, and labeling data sets automatically. Also provided are methods for browsing and navigating a hierarchically mapped data set, and identifying changes in network structure over time. An example method may involve receiving document data indicating a corpus of documents and references between documents within the corpus. Based on the document data, a network comprising two or more nodes and at least one directed edge may be determined. Also, a hierarchical partition of the documents may be determined based on the directed edges of the network. The hierarchical partition may define a plurality of nested modules, and each module in the plurality of nested modules may be associated with one or more respective documents within the corpus. The method may additionally include causing a graphical display to provide a visual indication of one or more of the plurality of nested modules.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.14/371,364, filed Jul. 9, 2014, which is a U.S. National Phase ofInternational Application No. PCT/US2013/024517, filed Feb. 1, 2013,which claims priority to U.S. Provisional Application No. 61/593,761,filed Feb. 1, 2012; and U.S. Provisional Application No. 61/593,749,filed Feb. 1, 2012 and U.S. Provisional Application No. 61/723,309,filed Nov. 6, 2012, and U.S. Provisional Application No. 61/722,955,filed Nov. 6, 2012, the disclosures of all of which are herebyincorporated by reference in their entireties.

STATEMENT OF U.S. GOVERNMENT INTEREST

This invention was made with Government Support under grant U54GM088588, awarded by the National Institutes of Health, and grant SBE0915005, awarded by the National Science Foundation. The government hascertain rights in the invention.

FIELD OF THE INVENTION

The disclosure herein relates generally to automatically classifyingdata sets, and in particular, to methods and systems for hierarchicallyclassifying, ranking, labeling, and navigating data sets.

BACKGROUND

Unless otherwise indicated herein, the materials described in thissection are not prior art to the claims in this application and are notadmitted to be prior art by inclusion in this section.

Organizing and classifying complex relational data can provide usefulinformation for comprehending the structure of large integrated systems.For example, community-detection algorithms may be used to identify thepresence of multiple sub-structures within a larger network and tocategorize respective nodes of a network into one of the multiplesub-structures. Herein, such sub-structures within a network may bereferred to as modules. Accordingly, a module may generally beunderstood to include a group of nodes in a network. For example, manynetworks include groups of nodes that are more densely connectedinternally than with the rest of a network.

While categorizing nodes into modules may be a good starting point, manycomplex networks contain relationships that are deeper than two levels.For instance, biological and social systems are often characterized by ahierarchical organization having multiple levels of submodules that arenested within modules. Network theory offers tools for identifyingnested modules within a network of interconnected nodes, but currenttechniques for automatically ranking and mapping network structures tendto be inefficient, incomplete, inaccurate, time-consuming, and dependenton direct human-mediation.

Improvements are therefore desired.

SUMMARY

Described herein are methods and systems for hierarchically mapping,ranking, and labeling data sets automatically. Also provided are methodsfor browsing and navigating a hierarchically mapped data set, andidentifying changes in network structure over time.

In one example aspect, a computer-implemented method is provided. Themethod may involve receiving document data indicating (i) a corpus ofdocuments and (ii) references between documents within the corpus ofdocuments. The method may further include determining a networkincluding (i) two or more nodes and (ii) at least one directed edge.Each node may correspond to a respective document in the corpus ofdocuments and each directed edge may connect two respective nodes andcorrespond to a reference between two documents in the corpus ofdocuments. Also, the method may include determining a hierarchicalpartition of the documents based on the directed edges of the network.The hierarchical partition may define a plurality of nested modules, andeach module in the plurality of nested modules may be associated withone or more respective documents within the corpus of documents. Themethod may additionally include causing a graphical display to provide avisual indication of one or more of the plurality of nested modules.

In another example aspect, a system including at least one processor, aphysical computer readable medium, and program instructions stored onthe physical computer readable medium is provided. The instructions maybe executable by the at least one processor to receive document dataindicating (i) a corpus of documents and (ii) references betweendocuments within the corpus of documents. The instructions may befurther executable to determine a network including (i) two or morenodes and (ii) at least one directed edge. Each node may correspond to arespective document in the corpus of documents, and each directed edgemay connect two respective nodes and correspond to a reference betweentwo documents in the corpus of documents. Additionally, the instructionsmay be executable to determine a hierarchical partition of the documentsbased on the directed edges of the network. The hierarchical partitionmay define a plurality of nested modules, and each module in theplurality of nested modules may be associated with one or morerespective documents within the corpus of documents.

In another example aspect, a physical computer readable medium havinginstructions stored thereon is provided. The instructions may includereceiving document data indicating (i) a corpus of documents and (ii)references between documents within the corpus of documents. Theinstructions may further include determining a network including (i) twoor more nodes and (ii) at least one directed edge. Each node maycorrespond to a respective document in the corpus of documents and eachdirected edge may connect two respective nodes and correspond to areference between two documents in the corpus of documents. Also, theinstructions may include determining a hierarchical partition of thedocuments based on the directed edges of the network. The hierarchicalpartition may define a plurality of nested modules, and each module inthe plurality of nested modules may be associated with one or morerespective documents within the corpus of documents. The instructionsmay additionally include causing a graphical display to provide a visualindication of one or more of the plurality of nested modules.

In a further aspect, a computer-implemented method is provided. Themethod may involve receiving partition data indicating a firsthierarchical partition of a first corpus of documents associated with afirst time period and receiving partition data indicating a secondhierarchical partition of a second corpus of documents associated with asecond time period. The first hierarchical partition may define a firstplurality of nested modules, and each module in the first plurality ofnested modules may be associated with one or more respective documentswithin the first corpus of documents. Similarly, the second hierarchicalpartition may define a second plurality of nested modules, and eachmodule in the second plurality of nested modules may be associated withone or more respective documents within the second corpus of documents.The method may also include comparing (i) a difference between (a) anumber of references to documents within a particular module of thefirst plurality of nested modules and (b) a number of references todocuments within a corresponding module of the second plurality ofnested modules to (ii) a threshold. Additionally, the method may includecausing a graphical display to provide a visual indication of thedifference and the particular module based on the comparison.

In a further aspect a system including at least one processor, aphysical computer readable medium, and program instructions stored onthe physical computer readable medium is provided. The instructions maybe executable by the at least one processor to receive partition dataindicating a first hierarchical partition of a first corpus of documentsassociated with a first time period, and receive partition dataindicating a second hierarchical partition of a second corpus ofdocuments associated with a second time period. The first hierarchicalpartition may define a first plurality of nested modules, and eachmodule in the first plurality of nested modules may be associated withone or more respective documents within the first corpus of documents.Similarly, the second hierarchical partition may define a secondplurality of nested modules, and each module in the second plurality ofnested modules may be associated with one or more respective documentswithin the second corpus of documents. The instructions may also beexecutable to compare (i) a difference between (a) a number ofreferences to documents within a particular module of the firstplurality of nested modules and (b) a number of references to documentswithin a corresponding module of the second plurality of nested modulesto (ii) a threshold. Additionally, the instructions may be executable tocause a graphical display to provide a visual indication of thedifference and the particular module based on the comparison.

In another example aspect, a physical computer readable medium havinginstructions stored thereon is provided. The instructions may includereceiving partition data indicating a first hierarchical partition of afirst corpus of documents associated with a first time period, andreceiving partition data indicating a second hierarchical partition of asecond corpus of documents associated with a second time period. Thefirst hierarchical partition may define a first plurality of nestedmodules, and each module in the first plurality of nested modules may beassociated with one or more respective documents within the first corpusof documents. Similarly, the second hierarchical partition may define asecond plurality of nested modules, and each module in the secondplurality of nested modules may be associated with one or morerespective documents within the second corpus of documents. Theinstructions may also include comparing (i) a difference between (a) anumber of references to documents within a particular module of thefirst plurality of nested modules and (b) a number of references todocuments within a corresponding module of the second plurality ofnested modules to (ii) a threshold. Additionally, the instructions mayinclude causing a graphical display to provide a visual indication ofthe difference and the particular module based on the comparison.

Aspects of the embodiments disclosed herein can be combined with otherembodiments or combinations of embodiments unless the context clearlydictates otherwise.

These as well as other aspects, advantages, and alternatives, willbecome apparent to those of ordinary skill in the art by reading thefollowing detailed description, with reference where appropriate to theaccompanying drawings.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a flow chart of an example method for determining ahierarchical partition of a corpus of documents.

FIG. 2 is a conceptual illustration of an example corpus of documents.

FIG. 3 is a conceptual illustration of an example network.

FIG. 4 is a conceptual illustration of a hierarchical cluster.

FIGS. 5A and 5B are conceptual illustrations of a multi-scale map.

FIG. 6 is a conceptual illustration of a search prompt.

FIGS. 7-9 show example search results.

FIG. 10 is a flow chart of an example method for identifying a change innetwork structure over time.

FIG. 11 is a flow diagram of an example approach for determining a levelof change in network structure over time.

FIG. 12 is a conceptual illustration of an alluvial diagram.

FIG. 13 is a conceptual illustration of an example patent report.

FIG. 14 is a simplified block diagram of an example computing device.

FIG. 15 is a schematic illustrating a conceptual partial view of anexample computer program product that includes a computer program forexecuting a computer process on a computing device.

DETAILED DESCRIPTION

In the following detailed description, reference is made to theaccompanying figures, which form a part thereof. In the figures, similarsymbols typically identify similar components, unless context dictatesotherwise. The illustrative embodiments described in the detaileddescription, figures, and claims are not meant to be limiting. Otherembodiments may be utilized, and other changes may be made, withoutdeparting from the spirit or scope of the subject matter presentedherein. It will be readily understood that aspects of the presentdisclosure, as generally described herein, and illustrated in thefigures, can be arranged, substituted, combined, separated, and/ordesigned in a wide variety of different configurations, all of which areexplicitly contemplated herein.

Further, for purposes of illustration, certain aspects of the disclosureherein will be described with respect to particular types of documents,such as patent documents and scholarly publications. It should beunderstood, however, that part or all of the described systems andmethods may apply equally to other types of documents. Therefore, thedescribed embodiments should not be taken to be limiting.

I. FIRST EXAMPLE METHOD

Described herein are methods and systems for hierarchically classifyingdata sets. The methods and systems may facilitate hierarchicallyclassifying, ranking, and labeling data sets. Also described are methodsfor navigating hierarchically classified data sets, and identifyingchanges in a network structure of a data set (e.g., as new data is addedto the data set over time, as existing data is removed from the data setover time).

In some examples, the methods and systems described herein may be usedto automatically classify a corpus of documents that can becharacterized as a time-directed network. In a time-directed network ofdocuments, documents generally reference or cite temporal predecessors(e.g., each respective document may only reference documents that werepublished prior to the document). Examples of groups of documents thatform time-directed networks include scholarly documents, patentdocuments, or litigation documents, whether published or not published,among other documents. Other aspects, uses, and advantages of themethods and systems described herein are described further below.

Additionally, although examples herein are described with respect toclassifying a corpus of documents, the methods and systems may also beapplicable to classifying other types of data sets. For example, thesystems and methods may similarly be used to hierarchically classify adata set that includes information identifying a plurality of socialmedia users and communications between the social media users. Otherexamples may exist.

With reference to FIG. 1, an example method 100 for determining ahierarchical partition of a corpus of documents is described. As shownin FIG. 1, initially at block 102, the method 100 includes receivingdocument data indicating a corpus of documents and references betweendocuments within the corpus of documents. At block 104, the method 100includes determining a network including two or more nodes and at leastone directed edge. Each node may correspond to a respective document inthe corpus of documents and each directed edge may correspond to areference between documents within the corpus of documents. At block106, the method 100 includes determining a hierarchical partition of thedocuments based on the directed edges of the network. The hierarchicalpartition may define a plurality of nested modules, and each module inthe plurality of nested modules may be associated with one or morerespective documents within the corpus of documents. At block 108, themethod 100 includes causing a graphical display to provide a visualindication of one or more of the plurality of nested modules. Thesesteps are more fully explained in the following subsections, withreference to FIGS. 2-4.

Generally, the methods and functions described herein may be carried outby a computing device, such as computing device 1402 of FIG. 14described further below. Again, however, it should be understood thatthe computing device 1402 of FIG. 14 is set forth for purposes ofexample and explanation only, and should not be taken to be limiting.The present methods and functions may just as well be carried out inother systems having other arrangements.

a. Receive Document Data for a Corpus of Documents

At block 102, the method 100 includes receiving document data indicating(i) a corpus of documents and (ii) references between documents withinthe corpus of documents. In some instances, the document data may bestored in a digital archive or other database and transmitted to acomputing device that is configured to perform the method 100. Forinstance, the computing device configured to perform the method 100 mayaccess the document data from a database, or another computing devicemay provide the document data. The document data may indicateinformation associated with any type of rendered documents, such aselectronic and/or printed documents. For example, the document data mayidentify each respective document of the corpus of documents. For eachrespective document, the document data may also include information suchas a title of the document, date, page count, author(s), a list of otherdocuments referencing the document, and a list of other documentsreferenced by the document, among other possible types of information.In one instance, the document data may be provided in the form of atab-delimited table.

In one example, the corpus of documents may include individual scholarlypapers or articles. In such an example, the references between documentsmay be citations from one scholarly article to one or more otherscholarly articles.

In another example, the corpus of documents may include patentdocuments. For instance, the patent documents may include one or more ofU.S. patents and patent applications, international patents and patentapplications, or foreign patents and patent applications, petty patents,design patents, plant patents, utility models, as well as any otherpatent-related or patent application-related document. In such anexample, a reference may be made or established from one document toanother based on a citation in one patent document that referencesanother patent document, an Office action (e.g., a document written byan examiner in a patent examination procedure that is mailed to anapplicant for a patent) that is received in an application for a patentdocument and identifies one or more other patent documents, aninformation disclosure statement filed in an application for a patentdocument that references one or more other patent documents, or othersimilar citations or associations between patent documents.

In a further example, documents within the corpus of documents maycorrespond to individual inventors. For instance, for each inventor ofeach patent document, a document corresponding to the inventor and thepatent document may be established. As an example, for a patent thatlists inventors A and B, the patent may be represented by two documentsin a corpus of documents. Each of the two documents may correspond toone of the respective inventors. Similarly, documents within the corpusof documents may correspond to individual patent firms that prepared apatent document. For instance, a corpus of documents may be formed whereeach document in the corpus corresponds to a document prepared by aparticular patent firm.

In still another example, the corpus of documents may includelitigation-related documents, such as court case summaries, court caserulings, and/or pre-trial documents. In such an example, the referencesbetween documents may include citations from one court case document toone or more other court case documents.

In still other examples, the corpus of documents may include webpages,contracts, licensing documents, government files, regulations, (e.g.,SEC filings, FDA applications or warning letters, EPA notifications),newspapers, magazine articles, social media pages, emails, twitterfeeds, and books. In one instance, documents in the corpus of documentsmay be genes and/or proteins. For example, genes may referencetranscription factors responsible for their expression.

Generally, documents in the corpus of documents may be any type ofdocuments that directly or indirectly reference one or more otherdocuments. The documents of the corpus of documents need not be writtendocuments. In some examples, the documents may include audio documents,video documents, or image documents. In an example in which documentsinclude audio documents, a reference between documents may be determinedbased on common audio sequences existing within two documents. In anexample in which documents include image documents, similar visual cuesmay serve as references between image documents. Other examples mayexist, and the described examples are not meant to be limiting.

Additionally, in one example, a reference between two documents may beestablished based on usage data. For example, a system may be configuredto track documents that are downloaded by a user as well as the order inwhich the documents are downloaded. Such traffic information may beindicative of an indirect reference between two documents. As aparticular example, if a first document is downloaded, and subsequentlya second document is downloaded, the flow between downloading the firstdocument and then downloading the second document may be used toestablish a reference from the first document to the second document.Thus, in some examples, the document data may indicate referencesbetween two documents that have been determined based on downloadtraffic, or other types of flow, between the two documents.

FIG. 2 is a conceptual illustration of an example corpus of documents200. As shown in FIG. 2, the corpus of documents 200 includes aplurality of references between two respective documents. A givendocument of the corpus of documents may reference any number of otherdocuments. For instance, a given document may reference zero, one, ormultiple other documents. As an example, document 202 referencesdocument 204 and document 206. Similarly, a given document may bereferenced by zero, one, or multiple other documents. As an example,document 208 is referenced by document 210. In some instances, twodocuments (e.g., two documents published at approximately the same time)may reference each other (not shown).

In FIG. 2, an arrow from a first document to a second documentrepresents a reference from the first document to the second document.For example, arrow 212 represents a reference from document 210 todocument 208. In some examples, the first document may reference thesecond document within the text of the first document. In otherexamples, a reference from a first document to a second document may beindirectly associated with the first document. For instance, if thecorpus of documents includes patent documents, the reference from thefirst document to the second document may be established based on aninformation disclosure statement that was filed during prosecution ofthe first document which identified the second document. Other examplesof indirectly associated references are also possible.

The references between documents of the corpus of documents 200 areexamples of time-directed references. In FIG. 2, documents that areabove other documents represent documents that are published after theother documents. A given document of the corpus of documents 200 mayreference document(s) that were published before the given document(documents below the given document in FIG. 2), but not documents thatwere published after the given document (documents above the givendocument in FIG. 2).

The configuration of the corpus of documents 200 is an example of atime-directed structure that is common for scholarly papers, patentdocuments, court case documents, and other types of documents, whereeach document directly or indirectly references one or more temporalpredecessors. However, the configuration of the corpus of documents 200shown in FIG. 2 is only an example. Other examples, with differentcharacteristics, may also exist.

b. Determine a Network

At block 104, the method 100 includes determining a network including(i) two or more nodes and (ii) at least one directed edge. Each node ofthe network may correspond to a respective document in the corpus ofdocuments. Also, each directed edge of the network may connect tworespective documents. And each such directed edge may correspond to areference between two documents in the corpus of documents. Morespecifically, a directed edge from a first node to a second node maycorrespond to a reference from a first document, represented by thefirst node, to a second document, represented by the second node.

FIG. 3 is a conceptual illustration of an example network 300. Theexample network 300 is an example of a network that may be determinedbased on the corpus of documents 200 of FIG. 2 and the referencesbetween documents associated with the corpus of documents 200. Theexample network 300 may be determined by assigning each respectivedocument of the corpus of documents 200 as a node of the network 300 andproviding a directed edge between two nodes for each correspondingreference between two documents in the corpus of documents 200. In FIG.3, each node of the example network is illustrated as a circle, and eachdirected edge is illustrated as an arrow. For example, as shown in FIG.3, a directed edge 302 may be established between a first node 304 and asecond node 306.

In one example, the received document data may be processed to allocateone node for each document in the corpus of documents. Subsequently,based on the allocation, for each reference between two documentsindicated in the received document data, an appropriately locateddirected edge may be added to the network 200 between two nodescorresponding to the two documents.

Although the determination of the network at block 104 is described withrespect to a visual depiction of the nodes and directed edges of thenetwork, determining such a visual depiction of the network may not benecessary. For example, determining the network may involve determininga set of nodes and determining a set of ordered pairs of nodes. Eachnode may correspond to a document, and each ordered pair may correspondto a directed edge between a first node of the ordered pair and a secondnode of the ordered pair. Thus, determining the network at block 104 maygenerally involve determining the nodes and directed edges of thenetwork, and storing information pertaining to the determined nodes anddirected edges of the network in memory for further use in accordancewith other blocks of the method 100.

c. Determine a Hierarchical Partition

At block 106, the method 100 includes determining a hierarchicalpartition of the documents based on the directed edges of the network.The hierarchical partition may define a plurality of nested modules, andeach module in the plurality of nested modules may be associated withone or more respective documents within the corpus of documents.Generally, a partition may be a classification of nodes into jointlycomprehensive, mutually exclusive, modules. However, partitions may alsooverlap in other examples (i.e., nodes may be in more than one module).A module may be a group of nodes. Nodes in a module may be closelyconnected to one another and loosely connected to other nodes in thenetwork. A hierarchical partition may therefore define an arrangement ofmodules that are represented as being “above”, “below”, or “at the samelevel as” one another. Thus, the determined hierarchical partition mayspecify which documents are associated with which modules, submodules,subsubmodules, and so on.

Before describing example approaches for determining a hierarchicalpartition, a brief overview of a flow-based and information theoreticclustering method, called the map equation, is provided. The mapequation has been shown to be a fast, accurate, and effective method forfinding structure in large networks. For example, minimizing the mapequation over all possible modular configurations of a network may beused to help determine the optimal modular description of the networkwith respect to the dynamics on the network.

Following the description of the map equation, a more generalized formof the map equation is discussed. In particular, a hierarchical mapequation is provided. The hierarchical map equation described below maybe used, for instance, to determine: into how many hierarchical levels agiven network may be organized; how many modules are present at eachhierarchical level; and which nodes are members of which modules.

Next, two example techniques for determining a hierarchical partitionusing the hierarchical map equation are described. Each of the exampletechniques provides an example of determining a hierarchical partitionof the documents based on the directed edges of the network inaccordance with block 106.

Generally, the map equation exploits the duality between compressingdata and inferring structure. Such a duality is, for example, exploredin the branch of statistics known as minimum description lengthstatistics, where the goal is to determine a hypothesis for a given dataset that leads to the best compression of the data. The map equationquantifies a minimum description length for a data set and can beapplied to detect two-level or multi-level community structures in anetwork. Specifically, the map equation uncovers community structures ofnodes in a network by modeling a process of flow over the directed edgesof that network.

Applying the map equation to a network can be visualized as followingthe path of a random walker from node to node to node as the randomwalker moves around on the directed edges of a network. Often, therandom walker may temporarily be located in highly interconnected groupsof nodes, or modules, with relatively long persistence times. Todescribe the structure of the network, a trace of the random walker'spath may be encoded in a compressed message. In one example, thecompressed message uses module codewords for transitions between modulesand node codewords that can be reused between modules. To make thecompressed message unambiguous, an index codebook may distinguish whichmodule codebook is active for a given portion of the compressed message.

Ultimately, the map equation measures the average description length,the average number of bits per step that are used to trace the randomwalker's path. Maximum compression is achieved when the modularstructure of the codewords matches the modular structure of the network.The map equation is designed to take advantage of the fact that anetwork structure often has localized regions of small groups of nodeswhere a random walker will stay for long persistence times and revealthose localized regions. Therefore, minimizing the map equation over allpossible modular configurations may be used to determine the bestmodular description of the network with respect to the dynamics on thenetwork. Note that using the map equation does not require that anactual compressed message be derived for a given partition. The mapequation reveals how efficient an optimal coded message would be for anygiven partition, without actually devising that coded message.

Additionally, the determination of the modular description of thenetwork is often dependent on a ranking algorithm that is used todetermine a relative value of each node in a network. For instance, aranking algorithm may be used to establish a stationary distributionthat indicates a visitation rate at each node during a random walk, forexample. Information from the stationary distribution may then be usedto determine the index codebook. Thus, ranking and hierarchicallyclassifying may be intertwined.

Referring back to the method 100 and block 106, in one example, toultimately determine a hierarchical partition that reveals patterns atmultiple levels of a network, a process of flow on the network may bemodeled and analyzed using a hierarchical map equation. While the mapequation employs a single index codebook to encode a compressed message,using a separate index codebook for each level of a multi-levelhierarchy makes it possible to exploit the fact that modules themselvesare often organized into larger modules.

Broadly, the hierarchical map equation releases the constraint of asingle index codebook and allows for an arbitrary number ofhierarchically nested index codebooks that identify movement betweenmodules, submodules, subsubmodules, etc. FIG. 4 is a conceptualillustration of a network that has been partitioned into a hierarchicalcluster 400. As shown in FIG. 4, each node is a part of a submodule 402that includes three nodes. Further, each submodule 402 is part of amodule 404 that includes three submodules 402. To encode a compressedmessage that describes a trace of a random walker throughout the nodesof the hierarchical cluster 400, each module 404 may be associated witha respective index codebook, and each submodule 402 within each module404 may also be associated with a respective index codebook. Note thatthe example hierarchical cluster 400 is provided for purposes of exampleand explanation, and the example is not meant to be limiting.

Formally, for a hierarchical map M of n nodes partitioned into mmodules, for which each module i has a submap M^(i) with m^(i)submodules, for which each submodule ij has a submap M^(ij) with m^(ij)submodules, and so on, the hierarchical map equation takes the form ofEquation 1:

L(M)=q

H(Q)+Σ_(i=1) ^(m) L(M ^(i))   (1)

with the description length of submap M^(i) at intermediate levels givenby Equation 2:

L(M ^(i))=

H(Q ^(i))+Σ_(j=1) ^(m) ^(i) L(M ^(ij))   (2)

and at the finest modular level by Equation 3:

L(M ^(ij))=

H(P ^(ij) ^(. . .) ^(k)).   (3)

At each submodule level, q

is the rate of codeword use for entering the m_(i) submodules or exitingto a coarser level and H(Q^(i)) is the frequency-weighted average lengthof the codewords in the subindex codebook. At the finest level,

is the rate of codeword use for visiting nodes in submodules ij . . . kor exiting to a coarser level and H(P^(ij) ^(. . .) ^(k)) is thefrequency-weighted average length of the codewords in the submodulecodebook.

In order to determine a hierarchical structure that optimally representsthe structure with respect to flow over the directed edges of thenetwork, a hierarchical partition of the network that minimizes thehierarchical map equation over all possible hierarchical partitions ofthe networks may be determined. Any greedy approach (fast butinaccurate) or Monte Carlo-based (accurate but slow) approach can beused to minimize the hierarchical map equation.

According to a first example approach for determining the hierarchicalpartition of the documents based on the directed edges of the network, amodified version of the search algorithm described in the publicationentitled “The Map Equation” by M. Rosvall, D. Axelsson, and C. Bergstrompublished in volume 178 of The European Physical Journal Special Topics(2009), which is herein incorporated by reference in its entirety, maybe used.

The above-referenced search algorithm was designed for use with the mapequation. At a high level, the algorithm proceeds as follows:neighboring nodes are joined into modules, which subsequently are joinedinto supermodules, and so on. First, each node may be assigned to itsown module. Then, in random sequential order, each node may be moved tothe neighboring module that results in a largest decrease of the mapequation. If no move results in a decrease of the map equation, the nodemay stay in its originally assigned module. The process may be repeated,each time in a new random sequential order, until no move generates adecrease of the map equation. Additionally, the algorithm can be furtherimproved by allowing for submodule movements and/or single-nodemovements.

For instance, to allow for submodule movements, each cluster may betreated as a network on its own and the main algorithm may be applied tothe cluster. This procedure generates one or more submodules for eachmodule. Then all submodules are moved back to their respective modulesof the previous step. At this stage, with the same partition as in theprevious step but with each submodule being freely moveable between themodules, the main algorithm is re-applied.

To allow for single-node movements, each node may be re-assigned to bethe sole member of its own module. Then all nodes may be moved back totheir respective modules of the previous step. At this stage, with thesame partition as in the previous step, but with each single node beingfreely moveable between the modules, the main algorithm may bere-applied.

The algorithm described above may be modified to explore multi-levelsolutions using the hierarchical map equation. In one example, thealgorithm may be modified to recursively try to add extra indexcodebooks both at coarser and finer levels. For example, the modifiedalgorithm may successively increase and decrease the depth of differencebranches of the multilevel code structure to try further compress arepresentation of the structure of the network.

In a further example modification, to reduce the small cohesive effectof random teleportation, the hierarchical map equation may be modifiedto only measure the description length of steps following links andexclude the steps associated with random teleportation. To excluderandom teleportation steps from the description length, the ergodic nodevisit frequencies p_(α) for α=1 . . . n with random teleportationprobability τ=0.15 may be calculated. Following, for every node a andfor all its outgoing links with relative weight w_(αβ) to node β, theprobability that that random walker does not teleport but rather followsa link in a given step may be calculated using Equation 4:

=(1−τ)p _(α) w _(αβ).   (4)

The node visit frequencies may then be calculated using Equation 5:

p _(α)=Σ_(β) ^(α)

  (5)

At the coarsest level, the relative rate of codeword use may be foundusing Equation 6:

Q={

/

}=

/

,

/

, . . . ,

/

  (6)

where the per-step average flow into the modules and the total codeworduse is:

=Σ_(i=1) ^(m)

  (7)

The Shannon information of movements at the coarsest level is therefore:

 H  ( Q ) =  ( - ∑ i = 1 m   log  ) . ( 8 )

At the intermediate level, the relative rate of codeword use is foundusing Equation 9:

Q ^(i)=

/

,

/

, . . . ,

/

  (9)

where the total codeword use is:

=

+Σ_(j=1) ^(m) ^(i)

  (10)

and the Shannon information of movements in the submodule is therefore:

 H  ( Q i ) =  ( -  log  - ∑ i = 1 m i   log  ) . ( 11 )

Finally, at the finest levels, the relative rate of codeword use is:

p ^(ij) ^(. . .) ^(k)=

/

, {p _(α∈ij) _(. . .) _(k)/

}  (12)

where the total codeword use is:

=

+Σ_(α∈ij) _(. . .) _(k) P _(α)  (13)

and the Shannon information of movements is therefore:

  ( -  log  - ∑ αε  ij   …   k m i   log  ) . ( 14 )

As described above, the hierarchical map equation measures the per-stepaverage code length used to describe a random walker's movements alongdirected edges between nodes of a network, for a given hierarchicalpartition. The modified algorithm described above is one example of afast stochastic recursive search algorithm that may be used, at block106, to determine the hierarchical partition.

The example fast stochastic recursive search algorithm described aboveworks by tracing long paths that move along directed edges of a network,and requires nearly ergodic flows to function effectively. In manyexamples, a network that is formed based on a corpus of documents thatexhibits time-directedness may be far from ergodic. Rather, such anetwork may take the form of an acyclic directed graph or acyclicdigraph.

The hierarchical map equation seeks to compress the dynamics set by thestandard PageRank algorithm. (The standard PageRank algorithm is furtherdescribed below.) Consequently, the hierarchical map equation suffersfrom the same problems as PageRank when applied to time-directednetworks. For example, when PageRank is applied to a time-directednetwork, trails of references between documents move inexorably back intime as the directed edges of the network are followed, and a stationarydistribution of a random walker is heavily time-biased. The randomwalker is essentially drawn inexorably back in time, visiting documentswith ever-earlier publication dates as it follows the directed links ofthe network. Modeling the flow of a random walker on a time-directednetwork in the form of an acyclic digraph network using a two-modedynamics approach overcomes the problem. The two-mode dynamics approachis another example approach that may be used to determine thehierarchical partition at block 106, and is further described below.

The standard PageRank algorithm, well known to one of ordinary skill inthe art, is a one-mode dynamics approach and works by tracing a randomwalker on a network. The frequency at which the random walker visitsdifferent nodes gives the ranking of the nodes. Most approaches toPageRank employ a single Markov chain that includes a smallteleportation probability added to the Markov chain. Teleportation,random jumps between the nodes at a certain rate, is added to the pathof the random walker in an effort to obtain a unique solution.

The two-mode dynamics approach to modeling the dynamics of PageRankmakes it possible to efficiently map a corpus of documents into nestedmodules without forced long paths. The typical approach to modeling thedynamics of PageRank uses a single Markov chain to determine both thestarting positions and the movements from those positions. In principle,there is no reason why a single Markov chain needs to be used todetermine both the starting positions and the movements from thosepositions. In the two-mode dynamics approach, instead of using a singleaperiodic irreducible Markov chain R to both set start frequencies andto determine moves from each node, one aperiodic irreducible Markovchain S is used to set starting frequencies for each node in aninitialization step, and a second Markov chain T covering the samedomain is used to specify the moves from each node in a ranking step. Inparticular, in the two-dynamics approach, teleportation involvesteleporting to links and splitting the flow equally between upstream anddownstream nodes along the directed links to determine startingpositions, and following links downstream in the subsequent moves fromeach node.

With respect to determining a hierarchical partition, the two-modedynamics approach amounts to looking at the encoding problem one step attime, and only minimizing the expected description length of a singlestep of a random walk. For mapping the hierarchy of a network, thetwo-mode dynamics approach can be thought of as following an infinitelylong Markov chain but ignoring the teleportation steps that guaranteeergodicity and only encoding the important ranking steps.Advantageously, only encoding the important ranking steps of thetwo-mode dynamics approach reveals important structures and efficientlyavoids exaggerated time-biased ranking.

While multiple example approaches for determining the hierarchicalpartition at block 106 have been described, other examples are alsopossible. The described algorithms and approaches for determining thehierarchical partition are not meant to be limiting.

d. Cause a Graphical Display to Provide a Visual Indication

At block 108, the method 100 includes causing a graphical display toprovide a visual indication of one or more of the plurality of nestedmodules. The graphical display may be any type of display device, suchas a display device that is communicatively coupled to a computingsystem that is configured to perform the method 100. In one example, alist of modules at a given hierarchical level may be caused to bedisplayed. In another example, an indication of one or more of theplurality of nested modules may be provided in the form of a multi-scalemap that is part of a hierarchical browser, as further described below.Other examples are also possible.

II. ADDITIONAL FUNCTIONS AND EXAMPLE APPLICATIONS

The example method 100 may include additional blocks. For example, themethod 100 or portions of the method 100 may be combined with otherblocks to perform any of the additional functions described below, ormay be combined with any of the other methods described below.

a. Ranking

In some examples, a rank value associated with each respective documentof a corpus of documents may be determined. The rank value may be anobjectively determined metric associated with each respective document.In one example, the rank value associated with each respective documentis determined based at least in part on a respective number ofreferences to the respective document and the rank value associated witheach of one or more documents referring to the respective document.

The ranking algorithm described below, which may be referred to as theEigenfactor algorithm, provides an example methodology for determiningwhich nodes in a network may be the most important or influential bydetermining a rank value for each node. If each node corresponds to adocument in a corpus of documents, the rank value of a node maycorrespond to a rank value of a respective document within the corpus ofdocuments.

When applied to a network of connected nodes, the ranking algorithmdetermines a modified form of the eigenvector centrality of each node inthe network. The intuition behind eigenvector centrality is thatimportant nodes are those which are linked to by other important nodes.The ranking algorithm models a random walk on a network. For example,the network may be similar to the example network 300 of FIG. 3 whereeach node corresponds to a document in a corpus of documents anddirected edges between the nodes are determined based on referencesbetween documents in the corpus of documents. The random walk may bedescribed by the column-stochastic from of a cross-citation matrix R.The cross-citation matrix R may be a square having dimensions n×n, wheren corresponds to the number of nodes in the network and the value ofR_(ij) for the cross-citation matrix is 1 if there is a directed edgefrom node j to node i, and otherwise is 0. The cross-citation matrix Rmay be normalized by the column sums of R to determine acolumn-stochastic matrix H.

Given the matrix H, a new stochastic matrix P corresponding to a Markovprocess may be defined by equation 15:

P=τH+(1−τ)a.e ^(T)   (15)

where a is a column vector such that a_(i)=1/(total number ofdocuments), and e^(T) is a row vector of 1's. According to Equation 15,the matrix P corresponds to a Markov process, which with probability τfollows a random walk on the network and with probability (1−τ)teleports to a random node.

The leading eigenvector π of the matrix P may then be solved for. Giventhe leading eigenvector π, a vector of rank values V for each node maybe determined using Equation 16:

$\begin{matrix}{V = {100{\frac{H\pi}{\sum\limits_{i}\lbrack{H\pi}\rbrack_{i}}.}}} & (16)\end{matrix}$

Thus, the ranking algorithm may be used to determine a rank valueassociated with each node and corresponding document of a corpus ofdocuments. Advantageously, the rank values determined using the rankingalgorithm are additive metrics. To determine the total rank value for agroup of nodes, the rank value for each respective node may be summed.Such a summation may be useful for determining the total rank value of apatent portfolio assigned to a given company, for example.

In other examples, other ranking algorithms may also be used. Forinstance, in one example, the described ranking algorithm may beextended using a two-mode dynamics approach that is similar to thetwo-mode dynamics approach described above. Instead of using a singleaperiodic irreducible Markov chain to both set starting frequencies andto determine moves from each node, one aperiodic irreducible Markovchain may be used to set starting frequencies for each node in aninitialization step, and a second Markov chain covering the same domainmay be used to specify the moves from each node in a ranking step.

Note that the described ranking algorithms are examples of rankingalgorithms that may be used to determine rank values associated witheach document. Other ranking algorithms and approaches may also exist.

b. Labeling

Hierarchically classifying a large corpus of documents into a pluralityof nested modules may reveal a large number of modules, submodules,subsubmodules, etc. Accordingly, it may be useful to turn to semanticlabeling approaches to automatically assign names to modules,submodules, subsubmodules, etc. For instance, a label for a respectivemodule (e.g., a module, a submodule, a subsubmodules, etc.) may bedetermined based on content of documents associated with the module.

In one example, an automated text-mining system may determineinformative (e.g., mutual-information maximizing) bigrams, trigrams, orn-grams associated with a respective module based on content ofdocuments associated with the module. The determined bigram, trigram, orn-gram may then be assigned as the label for the module. In a furtherexample, a rank value associated with each respective document of themodule may be used to weight any detected bigrams, trigrams, or n-gramsassociated with the module.

In another example, a rank value associated with each respectivedocument may be used to determine a sample group of documents that isrepresentative of the module. For example, the top n documentsassociated with the module, as determined by the highest rank value, maybe selected as a sample group. An automated text-mining system may thendetermine informative bigrams, trigrams, or n-grams, associated with thesample group based on the mutual content within documents of the samplegroup. A detected bigram, trigram, or n-gram associated with the samplegroup may then be assigned as a label for the module.

In yet another example, labels can be assigned by human operators toreflect the type of documents within a determined module. The describedexamples for labeling are not meant to be limiting, and other examplesare also possible.

In one example, determined labels may be used to describe natural andobjective clusters of documents that are identified using the methodsand systems described herein. Rather than arbitrary categories, such asart classes typically used by patent offices, for example, the methodsand systems described herein can identify a module of patent documentsthat have more affinity and similarity within one another than withpatent documents outside the module.

c. Navigating

Automatically classifying as well as ranking and/or labeling a corpus ofdocuments may facilitate new mechanisms for navigating the corpus ofdocuments. In one example, a plurality of nested modules that have beendetermined by a hierarchical partition may be displayed in a multi-scalemap. Individual modules may be labeled so that a user may hierarchicallybrowse through different modules to reveal submodules and/or documentsassociated with the individual models.

FIGS. 5A and 5B are conceptual illustrations of a multi-scale map. Theconceptual illustrations shown in FIGS. 5A and 5B are examplescreenshots of a hierarchical browser that may be used for navigating aset of hierarchically classified patent documents. Although FIGS. 5A and5B are described with respect to patent documents, the examples are notmeant to be limiting, and may equally apply to other types of documents.

As shown in FIG. 5A, modules are represented hierarchically in one ofthree vertical columns. In a left column, a top-level module 502,labeled as “Patents”, represents a module that each document of a corpusof patent documents is associated with. For instance, each patentdocument may be broadly categorized as a patent.

In a middle column of FIG. 5A, a plurality of submodules 504 are shown.The plurality of submodules 504 represent second-level modules. Eachpatent document may be associated with one of the plurality ofsubmodules 504. Additionally, in the right column of FIG. 5A, aplurality of submodules 506 are shown. Each of the plurality ofsubmodules 506 are submodules of one of the plurality of submodules 504.Each patent document may therefore also be associated with one of theplurality of subsubmodules 506.

As an example, one or more patent documents may be associated with a“processing system” submodule. The one or more patent documents of the“processing system” submodule may also be associated with a“communication system” submodule. Similarly, a “map display” submoduleand a “navigation system” submodule may be submodules of the“communication system” submodule.

In FIG. 5A, a size of each module (e.g., module, submodule,subsubmodule, etc.) is proportional to a total respective rank value ofdocuments within the module. For instance, the size of the“communication system” submodule is larger than the “apparatusimplanting” submodule because the total rank value of documents withinthe “communication system” submodule may be greater than the total rankvalue of documents within the “apparatus implanting” submodule.

Also shown in FIG. 5A is an ordered list 508 of patent documents for themodule shown in the left column. For example, the ordered list 508 showsthe highest ranked patent documents among all patent documentsassociated with the “Patents” module. In one instance, the ordered list508 may be determined based on rank values associated with each patentdocument.

In one example, a computing system that implements a hierarchicalbrowser may be configured to receive document-selection data indicatinga selection of a particular module, and cause the multi-scale map todisplay an indication of one or more submodules within the particularmodule. Document-selection data, as described herein, may generallyrefer to any selection of a document, group of documents, module, groupof modules, category, group of categories, or other information, that isprovided via an input device. For instance, the input device may be akeyboard, mouse, touchpad, touchscreen, microphone, or other type ofinput device.

As one example of document-selection data indicating a selection of aparticular module, a user may click on, or otherwise select, the “powersupply” submodule. In response to receiving the document-selection data,the multi-scale map may be modified as illustrated in the screenshot ofFIG. 5B. As shown in FIG. 5B, the “power supply” submodule is display ina left column. Additionally, submodules of the “power supply” submoduleare displayed in a middle column, including a “power factor” and “dcmotor” submodule. Further, nested submodules beneath the “power factor”and “dc motor” submodules are also displayed in a right column. Forexample, the “induction heating” and “current sharing” modules are shownas submodules of the “power factor” submodule.

In one example, the hierarchical browser may further allow a user to“drill-down” to a further hierarchical sub-level by selecting, forexample, one of the “power supply”, “dc motor”, “induction heating”,“current sharing”, “brushless dc”, or “induction motor” submodules.Similarly, the hierarchical browser may also be configured to allow auser to return back up to a previous hierarchical level by selecting an“up” option.

Additionally, in response to receiving document-selection dataindicating a selection of the “power supply” submodule, an ordered list510 of patent documents within the “power supply” submodule may bedetermined and displayed. The ordered list 510 of patent documents maybe determined and displayed based on the rank value associated withpatent documents associated with the “power supply” submodule.

The hierarchical browser may also allow a user to select a particularpatent document within the ordered list 508 or the ordered list 510 toreveal more information about a particular document. For instance, inresponse to receiving document-selection data indicating a particularpatent document, a visual indication of additional information about theparticular document may be provided. The visual indication may include,for instance, an ID, year, inventor(s), listed category (e.g., group artunit determined by the U.S. Patent Office), category/module in which thepatent document is hierarchically classified, and rank value, amongother possible information. In some instances, the visual indication mayalso include a list of other patent documents indirectly or directlyreferencing the patent document and/or other patent documents indirectlyor directly referenced by the patent document.

d. Searching

In some instances, a search engine that allows a user to search througha hierarchically classified data set may be provided. FIG. 6 is aconceptual illustration of a search prompt. The search engine may beconfigured to receive document-selection data that is provided via thesearch prompt. In one example, the document-selection data may be textentered into a search box 602 via a keyboard or keypad. In anotherexample, the document-selection data may be text that has been convertedfrom speech. The search engine may allow a user to search for particularpatent documents by ID number (e.g., patent number or patent publicationnumber), classified category/module, listed category (e.g., U.S. PatentOffice group art unit), and/or keyword, among other possible searchcriteria.

In one example, the search engine may receive document-selection dataindicating a particular category. For instance, a user may provide oneor more keywords associated with the particular category. FIG. 7illustrates example search results provided in response to receivingdocument-selection data indicating a selection of “virtual reality”. Asshown in FIG. 7, a first patent document titled “Method for virtualreality . . . ” and a second patent document “Virtual reality system . .. ” are provided, as well as other patent documents. The patentdocuments provided in FIG. 7 are shown ordered based on rank value. Inother instances, the patent documents in the search results may beordered by other criteria. Additionally, a user may be able to select adefault ordering-criterion prior to searching, or may be able tore-order search results by selecting a new ordering-criterion after thesearch results have been provided.

In another example, the search engine may receive document-selectiondata indicating a particular patent document. For instance, FIG. 8Aillustrates example search results provided in response to receivingdocument-selection data indicating a selection of patent document“x,123,xxx”. As shown in FIG. 8A, various information identifying thepatent document “x,123,xxx” is provided.

In one example, a hierarchically determined category/module (e.g., alowest-level module determined by hierarchical classification) and alisted category (e.g., a group art unit determined by the U.S. PatentOffice) may be shown. Color-coding may also be used to distinguishwhether or not the patent document is listed in the same category asother patent documents of the same lowest-level module that the patentdocument was classified in. As an example, if the listed category of thepatent document agrees with the listed category of the majority of otherpatent documents that are classified into the same module as the patentdocument, the listed category may be highlighted using a first color(e.g., green). If the listed category of the patent document isdifferent than the listed category of the majority of other patentdocuments that are classified into the same module as the patentdocument, the listed category may be highlighted using a second,different color (e.g., red) to indicate the discrepancy. Suchcolor-coding may be used to provide an indication of whether a naturaland objective module of patent documents identified by automatichierarchical classification provides a better grouping of patentdocuments than a predetermined art class, for instance.

In one instance, a user may further receive more information aboutreferences to and/or from the particular document. For example, FIG. 8Billustrates additional patent documents referenced by the “x,123,xxx”patent document. In one example, the patent documents referenced by the“x,123,xxx” patent document may be provided in response to a userclicking on the patent ID or the patent title. In another example,patent documents referencing the “x,123,xxx” patent document mayautomatically be provided in response to receiving document-selectiondata indicating the selection of the “x,123,xxx” patent document.

In some examples, aspects of the functionalities described with respectto FIG. 8B may be incorporated into the hierarchical browser describedabove with respect to FIGS. 5A and 5B. For instance, the hierarchicalbrowser may be configured to provide additional patent documentsreferenced by a document in response to receiving document-selectiondata indicating a selection of a particular document. For instance, inresponse to receiving document-selection data indicating a selection ofa particular document, a particular module associated with theparticular document may be provided. In another instance, in response toreceiving document-selection data indicating a selection of a particularpatent document, one or more prior art documents associated with theparticular patent document may be provided. For example, other patentdocuments associated with the module which the particular patentdocument is associated may be displayed in order based on rank value.Further, the other patent documents may be filtered based on a dateassociated with the particular patent document such that only otherpatent documents having dates prior to the particular patent documentare provided.

In one instance, such a technique of identifying prior art may be ableto identify prior art documents that may have otherwise been overlookeddue to narrow or improper classifications of patent documents within apredetermined classification system (e.g., a patent office technologyarea). As an example, a first patent may be classified in a first artunit by a patent office, but the first patent may be classified in agiven module by the hierarchical classification system described herein.The given module may include closely related patent documents that arenot within the first art unit. As a result, prior art documents from thegiven category, that might have otherwise been overlooked because theprior art documents are located within a separate technology area withthe patent office classification system, may be revealed.

In still another example, the search engine may receivedocument-selection data indicating a selection of a particular group ofpatent documents. For example, FIG. 9 illustrates example search resultsprovide in response to receiving document-selection data indicating aselection of the “x,234,xxx”, “x,345,xxx”, and “x,456,xxx” patentdocuments. As shown in FIG. 9, if document-selection data indicates theselection of multiple patent documents, in some instances, the totalrank value of the multiple patent documents may be provided.

e. Valuation

In an example in which a rank value is determined for each respectivedocument of a corpus of documents, the rank value may provide anindication about the relative value (e.g., monetary, strength,influence) of a particular document. For example, if each documentassociated with a corpus of documents is a patent document, and anindication of a dollar value for one or more patent documents is known,the relative dollar value may be determined for additional patentdocuments. In one instance, a dollar value of a particular patentdocument may be known based on a dollar value awarded as part of apatent litigation. In another instance, a dollar value of a particularpatent document may be known based on a dollar value that the patentdocument sold for from one party to another.

In one embodiment, a relative estimate of importance, strength,influence, and monetary value of any (or every) patent or patentapplication in a network may be determined. For instance, the relativeestimate of one or any combination of importance, strength, influence,and monetary value may either be determined for all nodes in thenetwork, or, alternatively, determined for any subset of nodes in anetwork.

As an example, to estimate one or more of each node's relative monetaryvalue, strength, influence, or importance, each patent document in anetwork may be ordered based on rank value. Bounded estimates of one orany combination of monetary value, strength, influence, or importancemay then be determined by associating objective measurements of monetaryvalue, strength, influence or importance with corresponding patentdocuments. Such an association may be made by associating metadata witheach patent document, for example. Metadata may generally be any type ofinformation that is associated with a particular node or a group ofnodes.

The objective measurements that are associated as metadata may then beused to estimate upper and lower bounds of monetary value, strength,influence, or importance. For example, one or more amounts of monetarydamages resulting from patent infringement litigations may be associatedwith corresponding patent documents within a network. Based on one ormore amounts of monetary damages associated with particular patentdocuments, estimates of a maximum and/or minimum litigation value of anyother patent documents having higher or lower respective rank values.

In one example, infringement damages of $8,000,000 may be associatedwith Patent A and infringement damages of $10,000,000 may be associatedwith Patent C. Also, Patent A may have a higher rank value than PatentC. If a third patent, Patent B has a rank value intermediate between therank values of Patent A and Patent C, an estimated litigation value ofgreater than $8,000,000 but less than $10,000,000 may be determined forPatent B. In another example, for a group of patents, Patents A, B, C,D, E, F, and G, arranged in rank value from highest to lowest, PatentsA, B, C, and D, may have been held to valid in patent infringementlitigation while Patents E, F, and G may have been held to be invalid.If another patent, Patent X, has a rank value that is intermediatebetween rank values of Patent B and Patent C, an estimate may be madethat Patent X is more likely to be held valid than invalid. Similarly,if Patent Y, having an rank value between rank values of Patent F andPatent G, a determination may be made that Patent Y is more likely to beheld to be invalid than valid. Thus, associating objective measurementswith patent documents located within a network can reveal useful andpredictive patterns with respect to the objective measurements or evenother data.

In one case, if dollar values associated with multiple documents in aparticular module are known, the dollar values may be used to determinea dollar value profile (e.g., an exponential, linear, or other type ofcurve) as a function of rank value. For instance, the profile mayindicate an estimated relationship between dollar value and rank value.Given the rank value of a particular document, the profile may then beused to infer a dollar value of the particular document by fitting therank value to the profile.

Generally, any type of regression method may be used to determine anobjective measurement of importance, influence, strength, or monetaryvalue based on known objective measurements and rank values associatedwith documents. Existing techniques for patent valuation or patentportfolio valuation may also be used to determine metadata forassociation with one or more patent documents.

Because the rank value is an objectively determined metric, othermethods are also contemplated for determining a dollar value associatedwith a particular document based on a rank value of the document.Further, a dollar value of a group of documents may also be determinedby summing the dollar value determined for each individual document ofthe group.

In other instances, metadata may also be used to constrain theconfiguration of a network of nodes. For instance, a network may beconstrained to only include patent documents that have associatedmetadata indicating that a patent was held to be valid. As anotherexample, a network of nodes may be filtered, based on metadataassociated with the nodes, to only include patent documents that werepublished or filed in the year 2011. Other examples of filtering orconstraining a network may also exist.

f. Identifying Technology Gaps

In some examples, the method 100 may be used to identify technology gapsor “white spaces” where little inventive action is occurring. Forinstance, if the corpus of documents includes patent documents, atotal-connection-number for one, multiple, or all of the plurality ofnested modules may be determined. The total-connection-number may be atotal of one or more of: (a) a number of references from patentdocuments associated with a module to other patent documents associatedwith the module and (b) a number of references to patent documentsassociated with the module from other patent documents not associatedwith the module. Given the total-connection-number of one or moremodules, one or more unconnected modules may be identified by comparingthe total-connection-number to a total-connection-number threshold. Forinstance, if a total-connection number of a particular module is below athreshold, the module may be identified as an unconnected module. Suchan unconnected module may represent a module from which technologicalfeatures of patent documents within the module may be combined withother technological features associated with patent documents outside ofthe module to develop new technological innovations.

In another example, fuzzy partitioning may be used to identifytechnology gaps. Fuzzy partitioning may be referred to hierarchicalpartitioning where modules may overlap. For instance, nodes of a networkmay belong to two or more overlapping modules at the same level. If anetwork is hierarchically partitioned using fuzzy partition, it may bepossible to identify, from the resulting partition, any modules that donot overlap with other modules. For instance, there may be modules thathave nodes which are only associated with a single module at one level.Such modules, that do not overlap with other modules at the same level,may be identified as modules where technological features might becombined with technological features from other modules to yield new andinteresting developments.

Other examples for identifying technology gaps may also exist, and thedescribed example is not meant to be limiting. Generally, analyzing thestructure of a plurality of nested modules may be useful to identifytechnology areas that are relatively unconnected and which may benefitfrom combining technological features from another area.

III. SECOND EXAMPLE METHOD

FIG. 10 is a flow chart of an example method 1000 for identifying achange in network structure over time. In one example, the method 1000may be used to identify locations within a hierarchically classifiednetwork where activity is relatively, high or low. For instance, if ahierarchical partition of a corpus of patent documents has beendetermined, the method 1000 may be used to identify locations (e.g.,modules) within the corpus of patent documents where inventive activityis relatively high or low based on rates or trends in references toand/or from the module over time. The method 1000 may be performed by acomputing device, such as the computing device 1402 of FIG. 14. Inaddition, one or more blocks of the method 1000 may be combined with oneor more blocks of the method 100 of FIG. 1. Further details of theexample method 1000 are now described.

At block 1002, the method 1000 includes receiving first partition dataindicating a first hierarchical partition of a first corpus of documentsassociated with a first time period. The first hierarchical partitionmay define a first plurality of nested modules, and each module in thefirst plurality of nested modules may be associated with one or morerespective documents within the first corpus of documents.

In one example, the first hierarchical partition may have beendetermined using the method 100 of FIG. 1. The first partition datareceived at block 1002 may therefore indicate a hierarchy of the nestedmodules and information regarding which documents are associated withwhich modules. The first plurality of nested modules may also be labeledbased on semantic information from respective documents of each moduleof the first plurality. The first corpus of documents may include anytype of documents that have references between them. For example, thefirst corpus of documents may be patent documents, scholarlypublications, or court case documents, among other possibilities. Thefirst time period may be a date range such as a range of years (e.g.,1995-2000), range of months (e.g., January 1999-December 1999), or othertime range.

At block 1004, the method 1000 includes receiving second partition dataindicating a second hierarchical partition of a second corpus ofdocuments associated with a second time period. The second hierarchicalpartition may define a second plurality of nested modules, and eachmodule in the second plurality of nested modules may be associated withone or more respective documents within the second corpus of documents.

The second partition data received at block 1004 may have some temporalrelationship to the first partition data received at block 1002. Forinstance, the second corpus of documents may include the first corpus ofdocuments and a plurality of additional documents that were publishedafter the first time period. As an example, the first corpus ofdocuments may be patent documents from 1979-1999 and the second corpusof documents may be patent documents from 1979-2009. As with the firstpartition data received at block 1002, the second partition datareceived at block 1004 may also have been determined using the method100 of FIG. 1.

Thus, in one example, the first partition data received at block 1002may represent a hierarchical network structure of a first corpus ofdocuments at a first instance in time while the second partition datareceived at block 1004 may represent a hierarchical network structure ofa second corpus of documents at a second instance in time that is laterthan the first instance in time.

At block 1006, the method 1000 includes comparing a difference between anumber of references to documents within a particular module of thefirst plurality of nested modules to a number of references to documentswithin a corresponding module of the second plurality of nested modulesto a threshold. In one example, the determination at block 1006 mayreveal a level of change associated with the particular module. Forinstance, FIG. 11 is a flow diagram of an example approach 1100 fordetermining a level of change in network structure over time.

Initially, first partition data 1102 and second partition data 1104 maybe received. The first partition data 1102 may be the partition datareceived at block 1002 and the second partition data may be thepartition data received at block 1004. Subsequently, at block 1106, aparticular module may be identified. For instance, the particular modulemay be identified based on document-selection data indicating aselection of a particular module. Based on the identified module, atblock 1108, a reference parameter associated with the particular modulemay be determined. For the example method 1000, the reference parametermay be a total number of references to documents within the particularmodule during the first time period.

Similarly, at block 1110, a module corresponding to the particularmodule may be identified within the second partition data 1104. Forexample, the corresponding module may be a module in the secondpartition data 1104 that has the same label as the particular moduleidentified at block 1106. Subsequently, at block 1112, a total number ofreferences to documents within the corresponding module during thesecond time period may be determined.

The determined numbers of references may be compared at block 1114 todetermine a difference. In one example, the difference may be determinedby subtracting the determined numbers of references. In another example,the difference may be a percentage increase or decrease. At blocks 1116and 1118, the difference may be compared to a first threshold and, ifnecessary, a second threshold to qualify the difference as, for example,high, medium, or low. Note that the comparison described with respect toFIG. 11 is one example approach, and is not meant to be limiting.

Returning back to the example method 1000 of FIG. 10, at block 1008, themethod 1000 includes based on the comparison, causing a graphicaldisplay to provide a visual indication of the difference and theparticular module. In one example, the graphical display may list anindication of whether the difference was an increase or decrease as wellas the label of the particular module, and optionally the correspondingmodule. In another example, the graphical display may list an indicationof whether the difference was high, medium, or low.

In a further example, the method 1000 may be repeated for multiplemodules of the received partition data. Additionally, the graphicaldisplay may be caused to generally provide a visual indication ofchanges in a network structure of the first hierarchical partition ofthe first corpus of documents and changes in a network structure of thesecond hierarchical partition of the second corpus of documents overtime. For instance, the visual indication of changes in networkstructure of the first hierarchical partition and the network structureof the second hierarchical partition may be provided in the form of analluvial diagram.

FIG. 12 is a conceptual illustration of an alluvial diagram 1200. In thealluvial diagram 1200, four hierarchical partitions corresponding to2001, 2003, 2005, and 2007, are shown. For the example of method 1000,the first hierarchical partition may be represented by the 2001 columnwhile the second hierarchical partition may be represented by the 2003column. In the alluvial diagram 1200, the horizontal blocks correspondto modules and are ordered by height. The height of each blockcorresponds to reference flow through documents of the correspondingmodule. The reference flow may include one or a combination ofreferences from other documents to documents in the module andreferences from documents in the module to other documents. The alluvialdiagram 1200 also illustrates how submodules within a module can splitor merge with other submodules/modules over time.

In a further extension of the method 1000, changes in sources ofreferences to documents of a particular module over time may beidentified. For instance, for a first hierarchical partitioncorresponding to a first time period, a source-summary of the sources ofany references to documents within a particular module may bedetermined. The source-summary may include a list of any other modulesthat had documents refer to documents within the particular module aswell as the total number of references from each particular module ofthe list. Similarly, a source-summary of the sources of any referencesto documents within a corresponding module for a second hierarchicalpartition associated with a second time period may be determined. Thesource-summary associated with the first hierarchical partition and thesource-summary associated with the second hierarchical partition maythen be compared to identify if the source of documents that arereferencing documents of the particular module is changing over time.

In another example, the method 1000 may be used to identifyrapidly-developing areas of technology. For instance, if documents ofthe first corpus of documents and the second corpus of documents eachcomprise patent documents, the method 1000 may be used to identifyrapidly-developing areas of patent activity. In one instance, therapidly-developing areas may be identified based on a difference between(a) a number of references from patent documents within a particularmodule of the first plurality of nested modules to other patentdocuments within the particular module and (b) a number of referencesfrom patent documents within a corresponding module of the secondplurality of nested modules to other patent documents within theparticular module. If, for example, the difference in numbers ofreferences is greater than a change-threshold, the module may beidentified as a rapidly-developing area.

In still another example, the method 1000 may be used to identifynewly-formed modules. For instance, if the first corpus of documents andthe second corpus of documents each comprise patent documents, a moduleof the second plurality of nested modules may be identified as anewly-formed module if (a) the module is not a module within the firstplurality of nested modules and (b) combines patent documents associatedwith two or more modules of the first plurality of nested modules.

IV. EXAMPLE DOCUMENT REPORT

In one example, a document report that includes information about how aparticular document is hierarchically classified, labeled, and/or rankedmay be provided. For instance, if documents of a corpus of documentsinclude patent documents, a patent document report may be provided. FIG.13 is a conceptual illustration of an example report 1300. As shown inFIG. 13, the example report 1300 may include one or more of a patentnumber, filing date, issue date, inventor(s), owner/assignee,licensee(s). Additionally, the example report 1300 may list informationabout patents from which the patent document claims priority and/orpatents claiming priority from the patent. Further, the example report1300 may list a rank value, time-normalized rank value, hierarchicallyclassified category, patent office classification. Also, the examplereport may list other highly ranked patent documents within thehierarchically classified category to which the patent document belongs.In addition, a document report may include metadata informationassociated with documents, such as any of the examples of metadatadescribed herein or other types of metadata.

The example report 1300 is provided as an example, and is not meant tobe limiting. Other reports including other types of information orconfigurations may also exist.

V. EXAMPLE SYSTEM

FIG. 14 is a simplified block diagram of an example computing device1402. It should be understood that this and other arrangements describedherein are set forth only as examples. Those skilled in the art willappreciate that other arrangements and elements (e.g., machines,interfaces, functions, orders, and groupings of functions, etc.) can beused instead and that some elements may be omitted altogether. Further,many of the elements described herein are functional entities that maybe implemented as discrete or distributed components in conjunction withother components and in any suitable combination and location. Variousfunctions described herein as being performed by one or more entitiesmay be carried out by hardware, firmware, and/or software. For instance,various functions may be carried out by a processor executinginstructions stored in memory.

The computing device 1402 may be configured to carry out the functionsdescribed herein. For example, the computing device 1402 may beconfigured to carry out any of the functions described with respect tothe example methods of FIGS. 1 and 10 described above. As one example,the computing device 1402 may be configured to determine a hierarchicalpartition of a corpus of documents. As another example, the computingdevice 1402 may be configured to identify a change in network structureover time.

As shown, computing device 1402 may include, without limitation, acommunication interface 1404, processor 1406, and data storage 1408, allof which may be communicatively linked together by a system bus,network, and/or other connection mechanism 1414.

Communication interface 1404 typically functions to communicativelycouple computing device 1402 to other devices and/or entities. As such,communication interface 1404 may include a wired (e.g., Ethernet,without limitation) and/or wireless (e.g., CDMA and/or Wi-Fi, withoutlimitation) communication interface, for communicating with otherdevices and/or entities. Communication interface 1404 may also includemultiple interfaces, such as one through which computing device 1402sends communication, and one through which computing device 1402receives communication. Communication interface 1404 may be arranged tocommunicate according to one or more types of communication protocolsmentioned herein and/or any others now known or later developed.

Processor 1406 may include one or more general-purpose processors (suchas INTEL processors or the like) and/or one or more special-purposeprocessors (such as digital-signal processors or application-specificintegrated circuits). To the extent processor 1406 includes more thanone processor, such processors could work separately or in combination.Further, processor 1406 may be integrated in whole or in part withwireless-communication interface 1404 and/or with other components.

Data storage 1408, in turn, may include one or more volatile and/ornon-volatile storage components, such as magnetic, optical, or organicmemory components. As shown, data storage 1408 may include program data1410 and program logic 1412 executable by processor 1406 to carry outvarious functions described herein. Although these components aredescribed herein as separate data storage elements, the elements couldjust as well be physically integrated together or distributed in variousother ways. For example, program data 1410 may be maintained in datastorage 208 separate from program logic 1412, for easy updating andreference by program logic 1412.

Program data 1410 may include various data used by computing device 1402in operation. As an example, program data 1410 may include informationpertaining to biomedical image data and/or pharmacokinetic models.Similarly, program logic 1412 may include any additional program data,code, or instructions necessary to carry out the functions describedherein. For example, program logic 1412 may include instructionsexecutable by processor 1406 for causing computing device 1402 to carryout any of those functions described herein.

VI. EXAMPLE COMPUTER READABLE MEDIUM

In some embodiments, the disclosed methods may be implemented bycomputer program logic, or instructions, encoded on a physical and/ornon-transitory computer-readable storage media in a machine-readableformat, or on other physical and/or non-transitory media or articles ofmanufacture. FIG. 15 is a schematic illustrating a conceptual partialview of an example computer program product that includes a computerprogram for executing a computer process on a computing device, arrangedaccording to at least some embodiments presented herein.

In one embodiment, the example computer program product 1500 is providedusing a signal bearing medium 1502. The signal bearing medium 1502 mayinclude one or more programming instructions 1504 that, when executed byone or more processors may provide functionality or portions of thefunctionality described herein. In some examples, the signal bearingmedium 1502 may encompass a computer-readable medium 1506, such as, butnot limited to, a hard disk drive, a Compact Disc (CD), a Digital VideoDisk (DVD), a digital tape, memory, etc. In some implementations, thesignal bearing medium 1502 may encompass a computer recordable medium1508, such as, but not limited to, memory, read/write (R/W) CDs, R/WDVDs, etc. In some implementations, the signal bearing medium 1502 mayencompass a communications medium 1510, such as, but not limited to, adigital and/or an analog communication medium (e.g., a fiber opticcable, a waveguide, a wired communications link, a wirelesscommunication link, etc.). Thus, for example, the signal bearing medium1502 may be conveyed by a wireless form of the communications medium1510. It should be understood, however, that computer-readable medium1506, computer recordable medium 1508, and communications medium 1510 ascontemplated herein are distinct mediums and that, in any event,computer-readable medium 1508 is a physical, non-transitory,computer-readable medium.

The one or more programming instructions 1504 may be, for example,computer executable and/or logic implemented instructions. In someexamples, a computing device such as that shown in FIG. 14 may beconfigured to provide various operations, functions, or actions inresponse to the programming instructions 1504 conveyed to the computingdevice by one or more of the computer readable medium 1506, the computerrecordable medium 1508, and/or the communications medium 1510.

The physical and/or non-transitory computer readable medium could alsobe distributed among multiple data storage elements, which could beremotely located from each other. The computing device that executessome or all of the stored instructions could be a computing device, suchas the computing device illustrated in FIG. 14. Alternatively, thecomputing device that executes some or all of the stored instructionscould be another computing device.

VII. CONCLUSION

It is intended that the foregoing detailed description be regarded asillustrative rather than limiting and that it is understood that thefollowing claims including all equivalents are intended to define thescope of the invention. The claims should not be read as limited to thedescribed order or elements unless stated to that effect. Therefore, allembodiments that come within the scope and spirit of the followingclaims and equivalents thereto are claimed as the invention.

1. A computer-implemented method, comprising: receiving document dataindicating (i) a corpus of documents and (ii) references betweendocuments within the corpus of documents; determining a networkcomprising (i) two or more nodes, wherein each node corresponds to arespective document in the corpus of documents, and (ii) at least onedirected edge, wherein each directed edge connects two respective nodes,and wherein each directed edge corresponds to a reference between twodocuments in the corpus of documents; based on the directed edges of thenetwork, determining a hierarchical partition of the documents, whereinthe hierarchical partition defines a plurality of nested modules, andwherein each module in the plurality of nested modules is associatedwith one or more respective documents within the corpus of documents;and causing a graphical display to provide a visual indication of one ormore of the plurality of nested modules.
 2. The computer-implementedmethod of claim 1, wherein each reference between documents within thecorpus of documents is time-directed, and wherein the determined networkis in a form of an acyclic directed graph.
 3. The computer-implementedmethod of claim 1, further comprising determining a rank valueassociated with each respective document of the corpus based on thereferences between documents of the corpus of documents, wherein therank value associated with each respective document is determined basedat least in part on a respective number of references to the respectivedocument and the rank value associated with each of one or moredocuments referring to the respective document.
 4. Thecomputer-implemented method of claim 3, further comprising: receivingdocument-selection data indicating a selection of a particular document;and causing the graphical display to provide a visual indication of arespective value associated with the particular document.
 5. Thecomputer-implemented method of claim 3, further comprising: receivingdocument-selection data indicating a selection of a group of documents;determining a total rank value associated with the group of documents,wherein determining the total rank value comprises summing the rankvalue associated with each respective document of the group ofdocuments; and causing the graphical display to provide a visualindication of the total rank value associated with the group ofdocuments.
 6. The computer-implemented method of claim 3, furthercomprising: receiving document-selection data indicating anidentification of a particular category; identifying one or moredocuments associated with the particular category; determining anordered list of the one or more identified documents based on the rankvalue associated with each of the one or more identified documents; andcausing the graphical display to provide a visual indication of thedetermined ordered list of the one or more identified documents.
 7. Thecomputer-implemented method of claim 3, further comprising: receivingdocument-selection data indicating a selection of a particular module ofthe plurality of nested modules; determining an ordered list of one ormore documents within the module based on the rank value associated witheach of the one or more documents within the module; and causing agraphical display to provide a visual indication of the determinedordered list of one or more documents within the module.
 8. Thecomputer-implemented method of claim 3, further comprising determining amonetary value associated with each of one or more respective documentsof the corpus of documents based on the rank value associated with thedocument.
 9. The computer-implemented method of claim 1, furthercomprising: receiving document-selection data indicating a selection ofa particular document; and causing the graphical display to provide avisual indication of a module comprising the particular document. 10.The computer-implemented method of claim 1, further comprising:receiving document-selection data indicating a selection of a particularmodule of the plurality of nested modules; and causing the graphicaldisplay to provide a visual indication of one or more submodulesassociated with the particular module.
 11. The computer-implementedmethod of claim 10, wherein a respective size of each of the one or moresubmodules is proportional to a total respective rank value of documentswithin the submodule.
 12. The computer-implemented method of claim 1,wherein a respective size of each of the one or more modules in thevisual indication of the one or more of the plurality of nested modulesis proportional to a total respective rank value of documents within themodule.
 13. The computer-implemented method of claim 1, furthercomprising: determining a label for each of one or more respectivemodules of the plurality of nested modules based on mutual contentwithin one or more documents of the module; and causing the graphicaldisplay to provide a visual indication of the label for one or more ofthe plurality of nested modules.
 14. The computer-implemented method ofclaim 1, wherein determining a hierarchical partition of the documentscomprises determining a hierarchical partition that minimizes ahierarchical map equation, wherein the hierarchical map equationquantifies an average description length associated with modeling aprocess of flow on the network.
 15. The computer-implemented method ofclaim 1, wherein the received document data comprises document dataassociated with a first time period, and wherein the method furthercomprises: receiving partition data indicating a hierarchical partitionof another corpus of documents associated with a second time period,wherein the hierarchical partition defines another plurality of nestedmodules, and wherein each module in the another plurality of nestedmodules is associated with one or more respective documents within theanother corpus of documents; comparing (i) a difference between (a) anumber of references to documents within a particular module associatedwith the first time period and (b) a number of references to documentswithin a corresponding module associated with the second time period to(ii) a threshold; and based on the comparison, causing a graphicaldisplay to provide a visual indication of the difference and theparticular module.
 16. The computer-implemented method of claim 1,wherein documents of the corpus of documents comprise one or more ofpatent documents, scholarly documents, litigation documents, governmentdocuments, social media documents, online documents, magazine articles,and books.
 17. The computer-implemented method of claim 1, whereindocuments of the corpus of documents comprise patent documents, andwherein each module of the plurality of nested modules comprises anobjective classification of a technology area.
 18. Thecomputer-implemented method of claim 1, wherein documents of the corpusof documents comprise patent documents, and the method furthercomprising: receiving document-selection data indicating a selection ofa particular patent document; and identifying one or more documentswithin a module of the particular patent document having dates prior toa date associated with the particular patent document and rank valuesassociated with other patent documents within a module of the particularpatent document.
 19. A system comprising: at least one processor; aphysical computer readable medium; and program instructions stored onthe physical computer readable medium and executable by the at least oneprocessor to: receive document data indicating (i) a corpus of documentsand (ii) references between documents within the corpus of documents;determine a network comprising (i) two or more nodes, wherein each nodecorresponds to a respective document in the corpus of documents, and(ii) at least one directed edge, wherein each directed edge connects tworespective nodes, and wherein each directed edge corresponds to areference between two documents in the corpus of documents; and based onthe directed edges of the network, determine a hierarchical partition ofthe documents, wherein the hierarchical partition defines a plurality ofnested modules, and wherein each module in the plurality of nestedmodules is associated with one or more respective documents within thecorpus of documents.
 20. A physical computer readable medium havinginstructions stored thereon, the instructions comprising: instructionsfor receiving document data indicating (i) a corpus of documents and(ii) references between documents within the corpus of documents;instructions for determining a network comprising (i) two or more nodes,wherein each node corresponds to a respective document in the corpus ofdocuments, and (ii) at least one directed edge, wherein each directededge connects two respective nodes, and wherein each directed edgecorresponds to a reference between two documents in the corpus ofdocuments; instructions for determining a hierarchical partition of thedocuments based on the directed edges of the network, wherein thehierarchical partition defines a plurality of nested modules, andwherein each module in the plurality of nested modules is associatedwith one or more respective documents within the corpus of documents;and instructions for causing a graphical display to provide a visualindication of one or more of the plurality of nested modules.